Skip to content

Conversation

@dfukalov
Copy link
Collaborator

For some targets, the optimization X == Const ? X : YX == Const ? Const : Y can cause extra register usage for the constant in v_cndmask. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again.

For SWDEV-506659.

For some targets, the optimization `X == Const ? X : Y` → `X == Const ? Const : Y`
can cause extra register usage for the constant in `v_cndmask`.
This patch detects such cases and reuses the register from the compare instruction
that already holds the constant, instead of materializing it again.

For SWDEV-506659.
@llvmbot
Copy link
Member

llvmbot commented Mar 13, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Daniil Fukalov (dfukalov)

Changes

For some targets, the optimization X == Const ? X : YX == Const ? Const : Y can cause extra register usage for the constant in v_cndmask. This patch detects such cases and reuses the register from the compare instruction that already holds the constant, instead of materializing it again.

For SWDEV-506659.


Full diff: https://github.com/llvm/llvm-project/pull/131146.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+69-4)
  • (added) llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll (+171)
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 91df516b80857..f2857cd381c7e 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1411,15 +1411,80 @@ bool SIFoldOperandsImpl::tryFoldCndMask(MachineInstr &MI) const {
       Opc != AMDGPU::V_CNDMASK_B64_PSEUDO)
     return false;
 
+  // Try to find optimized Y == Const ? Const : Z. If Const can't be directly
+  // encoded in the cndmask, try to reuse a register already holding the Const
+  // value from the comparison instruction.
+  auto tryFoldCndMaskCmp =
+      [&](MachineOperand *SrcOp, std::optional<int64_t> SrcImm,
+          unsigned CmpOpcodes[4], AMDGPU::OpName CmpValName) -> bool {
+    // We'll try to process only register operands with known values.
+    if (!SrcImm || !SrcOp->isReg())
+      return false;
+
+    // Find the predicate of the cndmask instruction.
+    MachineOperand *PredOp = TII->getNamedOperand(MI, AMDGPU::OpName::src2);
+    if (!PredOp || !PredOp->isReg())
+      return false;
+
+    MachineInstr *PredI = MRI->getVRegDef(PredOp->getReg());
+    if (!PredI || !PredI->isCompare())
+      return false;
+
+    unsigned CmpOpc = PredI->getOpcode();
+
+    if (CmpOpc != CmpOpcodes[0] && CmpOpc != CmpOpcodes[1] &&
+        CmpOpc != CmpOpcodes[2] && CmpOpc != CmpOpcodes[3])
+      return false;
+
+    // Check if the immediate value of the source operand matches the immediate
+    // value of either the first or second operand of the comparison
+    // instruction.
+    MachineOperand *SubstOp = nullptr;
+    std::optional<int64_t> CmpValImm = getImmOrMaterializedImm(
+        *TII->getNamedOperand(*PredI, AMDGPU::OpName::src0));
+    if (CmpValImm && *CmpValImm == *SrcImm) {
+      SubstOp = TII->getNamedOperand(*PredI, AMDGPU::OpName::src1);
+    } else {
+      CmpValImm = getImmOrMaterializedImm(
+          *TII->getNamedOperand(*PredI, AMDGPU::OpName::src1));
+      if (CmpValImm && *CmpValImm == *SrcImm) {
+        SubstOp = TII->getNamedOperand(*PredI, AMDGPU::OpName::src0);
+      } else {
+        return false;
+      }
+    }
+
+    if (!SubstOp || !SubstOp->isReg())
+      return false;
+
+    LLVM_DEBUG(dbgs() << "Folded " << MI << " into ");
+    SrcOp->setReg(SubstOp->getReg());
+    LLVM_DEBUG(dbgs() << MI);
+    return true;
+  };
+
   MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
   MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
   if (!Src1->isIdenticalTo(*Src0)) {
-    std::optional<int64_t> Src1Imm = getImmOrMaterializedImm(*Src1);
-    if (!Src1Imm)
-      return false;
+    // Try to fold with not-equal comparisons
+    unsigned NECmpOpcodes[4] = {
+        AMDGPU::V_CMP_NEQ_F32_e64, AMDGPU::V_CMP_LG_F32_e64,
+        AMDGPU::V_CMP_NE_I32_e64, AMDGPU::V_CMP_NE_U32_e64};
 
     std::optional<int64_t> Src0Imm = getImmOrMaterializedImm(*Src0);
-    if (!Src0Imm || *Src0Imm != *Src1Imm)
+    if (tryFoldCndMaskCmp(Src0, Src0Imm, NECmpOpcodes, AMDGPU::OpName::src1))
+      return true;
+
+    // Try to fold with equal comparisons
+    unsigned EQCmpOpcodes[4] = {
+        AMDGPU::V_CMP_EQ_F32_e64, AMDGPU::V_CMP_EQ_F64_e64,
+        AMDGPU::V_CMP_EQ_I32_e64, AMDGPU::V_CMP_EQ_U32_e64};
+
+    std::optional<int64_t> Src1Imm = getImmOrMaterializedImm(*Src1);
+    if (tryFoldCndMaskCmp(Src1, Src1Imm, EQCmpOpcodes, AMDGPU::OpName::src0))
+      return true;
+
+    if (!Src0Imm || !Src1Imm || *Src0Imm != *Src1Imm)
       return false;
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll b/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll
new file mode 100644
index 0000000000000..0b7106448bb7c
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/fold-cndmask-select.ll
@@ -0,0 +1,171 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX9
+; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs  < %s | FileCheck %s -check-prefix=GFX10
+
+define float @f32_oeq_v_i(float %arg, float %arg1) {
+; GFX9-LABEL: f32_oeq_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_neq_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_oeq_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_neq_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp oeq float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float 0x3FCF5C2900000000, float %arg1
+  ret float %select
+}
+
+define float @f32_oeq_i_v(float %arg, float %arg1) {
+; GFX9-LABEL: f32_oeq_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_neq_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_oeq_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_neq_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp oeq float 0x3FCF5C2900000000, %arg
+  %select = select i1 %fcmp, float 0x3FCF5C2900000000, float %arg1
+  ret float %select
+}
+
+define float @f32_one_v_i(float %arg, float %arg1) {
+; GFX9-LABEL: f32_one_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_lg_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_one_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_lg_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp one float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float %arg1, float 0x3FCF5C2900000000
+  ret float %select
+}
+
+define float @f32_one_i_v(float %arg, float %arg1) {
+; GFX9-LABEL: f32_one_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x3e7ae148
+; GFX9-NEXT:    v_cmp_lg_f32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: f32_one_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_lg_f32_e32 vcc_lo, 0x3e7ae148, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x3e7ae148, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %fcmp = fcmp one float %arg, 0x3FCF5C2900000000
+  %select = select i1 %fcmp, float %arg1, float 0x3FCF5C2900000000
+  ret float %select
+}
+
+define i32 @i32_eq_v_i(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_eq_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_eq_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp eq i32 %arg, 424242
+  %select = select i1 %icmp, i32 424242, i32 %arg1
+  ret i32 %select
+}
+
+define i32 @i32_eq_i_v(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_eq_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_eq_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp eq i32 424242, %arg
+  %select = select i1 %icmp, i32 424242, i32 %arg1
+  ret i32 %select
+}
+
+define i32 @i32_ne_v_i(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_ne_v_i:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_ne_v_i:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp ne i32 %arg, 424242
+  %select = select i1 %icmp, i32 %arg1, i32 424242
+  ret i32 %select
+}
+
+define i32 @i32_ne_i_v(i32 %arg, i32 %arg1) {
+; GFX9-LABEL: i32_ne_i_v:
+; GFX9:       ; %bb.0: ; %bb
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    s_mov_b32 s4, 0x67932
+; GFX9-NEXT:    v_cmp_ne_u32_e32 vcc, s4, v0
+; GFX9-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-LABEL: i32_ne_i_v:
+; GFX10:       ; %bb.0: ; %bb
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_cmp_ne_u32_e32 vcc_lo, 0x67932, v0
+; GFX10-NEXT:    v_cndmask_b32_e32 v0, 0x67932, v1, vcc_lo
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+bb:
+  %icmp = icmp ne i32 424242, %arg
+  %select = select i1 %icmp, i32 %arg1, i32 424242
+  ret i32 %select
+}

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this earlier, before selection? The same problem exists for the scalar case

%icmp = icmp ne i32 424242, %arg
%select = select i1 %icmp, i32 %arg1, i32 424242
ret i32 %select
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test half, i16, and bfloat cases. Plus 64-bit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test half, i16, and bfloat cases. Plus 64-bit

I added half and i16 types. For bloat case we'll get no advantage here, since the imm is stored in two registers in any case: one (shifted left) for compare and second (original) for cndmask, like:

v_lshlrev_b32_e32 v2, 16, v0
s_mov_b32 s4, 0x42420000
v_cmp_eq_f32_e32 vcc, s4, v2
v_cndmask_b32_e32 v0, v1, v0, vcc

For the 64-bit types this folding doesn't work yet, since they are lowered into different pattern cmp/cndmask with pairs of registers (with REG_SEQUENCEs). I would implement it incrementally in additional patch.

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefix=GFX9
; RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s -check-prefix=GFX10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about with inline immediate? It shouldn't be any better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cases the immediate is already inlined into v_cndmask if it was possible. E.g. in the test they are successfully inlined for gfx1030 before the moment when the patch processes v_cndmask. In such cases it just skips the instruction.

@dfukalov
Copy link
Collaborator Author

Can we do this earlier, before selection? The same problem exists for the scalar case

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

@jayfoad
Copy link
Contributor

jayfoad commented Mar 13, 2025

Is this going to match X == +0.0 ? X : Y --> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

@arsenm
Copy link
Contributor

arsenm commented Mar 13, 2025

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

That would be too late, plenty of compare and select patterns appear during legalization

@dfukalov
Copy link
Collaborator Author

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.

I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

@dfukalov
Copy link
Collaborator Author

I tried to detect these cases and revert the optimisation in amdgpu-codegenprepare, but realised that in some cases (e.g. when immediate can be encoded into instruction, depending on the target and the constant) it can increase registers usage (as well as instructions count).

That would be too late, plenty of compare and select patterns appear during legalization

I guess the exact pattern with same imms in two registers in cmp/select is not quite common. Moreover, in case we're getting it at the folding stage, it would be also profitable to spare a register.

@jayfoad
Copy link
Contributor

jayfoad commented Mar 18, 2025

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.

I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

@dfukalov
Copy link
Collaborator Author

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.
I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

This patch deals with result of InstCombine that started to optimize more cases and tries to revert this transform at machine IR stage (in terms of ternary operation X == Const ? Const : Y -> X == Const ? X : Y).

I've just added tests to check this case:
For X == +0.0 ? X : Y case InstCombine doesn't transform it and so the patch doesn't process such cndmask instruction.
For X == +0.0 ? +0.0 : Y case InstCombine also doesn't transform it, but imm 0.0 is encoded into cmp and cndmask instructions before the patch so it just ignores the case.

@jayfoad
Copy link
Contributor

jayfoad commented Mar 19, 2025

Is this going to match X == +0.0 ? X : Y -> X == +0.0 ? +0.0 : Y? That would be wrong, because if X is -0.0 then the first one evaluates to -0.0 and the second one evaluates to +0.0.

X == Const ? X : Y -> X == Const ? Const : Y optimisation is performed in InstCombine on llvm ir. Here I try to dial with cases it can degrade performance.
I double-checked the case X == +0.0 ? X : Y and actually InstCombine doesn't touch such operations and doesn't transform in to X == +0.0 ? +0.0 : Y, so there is no such issue.

OK, so InstCombine is not broken, but this patch will try to do the equivalent transformation, so this patch is broken.

This patch deals with result of InstCombine that started to optimize more cases and tries to revert this transform at machine IR stage (in terms of ternary operation X == Const ? Const : Y -> X == Const ? X : Y).

I've just added tests to check this case: For X == +0.0 ? X : Y case InstCombine doesn't transform it and so the patch doesn't process such cndmask instruction. For X == +0.0 ? +0.0 : Y case InstCombine also doesn't transform it, but imm 0.0 is encoded into cmp and cndmask instructions before the patch so it just ignores the case.

This pass should handle all MIR correctly, not just the MIR that we currently see as a result of how InstCombine behaves today.

@dfukalov
Copy link
Collaborator Author

This pass should handle all MIR correctly, not just the MIR that we currently see as a result of how InstCombine behaves today.

Actually the pass doesn't wait a result of InstCombine only, but checks the pattern for a v_cndmask instruction like:

%10:sgpr_32 = S_MOV_B32 1048240456
%11:sreg_64_xexec = nofpexcept V_CMP_NEQ_F32_e64 0, %8:vgpr_32, 0, %10:sgpr_32, 0, implicit $mode, implicit $exec
%13:vgpr_32 = V_MOV_B32_e32 1048240456, implicit $exec
%12:vgpr_32 = V_CNDMASK_B32_e64 0, %13:vgpr_32, 0, %9:vgpr_32, killed %11:sreg_64_xexec, implicit $exec

For both non-mask parameters of the instruction it:

  1. checks if the paramer is actually a reg that contains imm
  2. gets def of the mask parameter
  3. if it is one of supported compare operations, checks, that it is actually comparison with the same imm allocated in a register
  4. regarding EQ or NEQ compare reuses this register in cndmask ubstruction, so in the previous pattern %8 is used instead of %13:
%10:sgpr_32 = S_MOV_B32 1048240456
%11:sreg_64_xexec = nofpexcept V_CMP_NEQ_F32_e64 0, %8:vgpr_32, 0, %10:sgpr_32, 0, implicit $mode, implicit $exec
%13:vgpr_32 = V_MOV_B32_e32 1048240456, implicit $exec
%12:vgpr_32 = V_CNDMASK_B32_e64 0, %8:vgpr_32, 0, %9:vgpr_32, killed %11:sreg_64_xexec, implicit $exec

Or, did you mean the test should be reworked to MIR tests?

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test case that your patch will break because it does not handle +0/-0 correctly.

define float @f32_oeq_v_negzero(float %arg, float %arg1) {
bb:
  %fcmp = fcmp oeq float %arg, 0x8000000000000000
  %select = select i1 %fcmp, float 0x8000000000000000, float %arg1
  ret float %select
}

@dfukalov
Copy link
Collaborator Author

Here is a test case that your patch will break because it does not handle +0/-0 correctly.

Yes, you're right, thanks!
Now it checks for +0/-0 float point imm and doesn't use its reg now.

@dfukalov dfukalov requested a review from jayfoad March 25, 2025 16:19
@dfukalov
Copy link
Collaborator Author

dfukalov commented Apr 9, 2025

Ping...

1 similar comment
@dfukalov
Copy link
Collaborator Author

Ping...

@dfukalov
Copy link
Collaborator Author

@jayfoad would you please take a look at the patch?

@dfukalov dfukalov closed this Jul 14, 2025
dfukalov added a commit that referenced this pull request Jul 16, 2025
…dmask (#148740)

For some targets, the optimization X == Const ? X : Y -> X == Const ?
Const : Y can cause extra register usage or redundant immediate encoding
for the constant in cndmask generated from the ternary operation.

This patch detects such cases and reuses the register from the compare
instruction that already holds the constant, instead of materializing it
again for cndmask.

The optimization avoids immediates that can be encoded into cndmask
instruction (including +-0.0), as well as !isNormal() constants.

The change is reworked on the base of #131146

---------

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants